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Abstract 

We introduce and evaluate data analysis methods to interpret simultaneous measurement of multiple genomic 
features made on the same biological samples. Our tools use gene sets to provide an interpretable common scale 
for diverse genomic information. We show we can detect genetic effects, although they may act through different 
mechanisms in different samples, and show we can discover and validate important disease-related gene sets that 
would not be discovered by analyzing each data type individually. 



Background 

The increasing affordability of high throughput genome- 
wide assays is enabling the simultaneous measurement 
of several genomic features in the same biological sam- 
ples. Cancer genome projects have been at the forefront 
of this trend, and have faced the challenge of integrating 
these diverse data types [1,2], including RNA transcrip- 
tional levels, genotype variation, DNA copy number var- 
iation, and epigenetic marks. Annotated collections of 
gene sets, capturing established knowledge about biolo- 
gical processes and pathways, have proven an essential 
tool for integration. Examples of these sets include chro- 
mosomal locations, signaling and metabolic pathways, 
transcriptional programs, and targets of specific tran- 
scription factors. Because one can make inferences 
about the importance of a given gene set using several 
different genomic data types, gene set analysis provides 
a direct and biologically motivated approach to analyz- 
ing these data types in an integrated way. A widely used 
public collection of gene sets is the Molecular Signa- 
tures Database (MSigDb) [3]. A comprehensive list of 
conventional tools for gene set analysis for a single data 
type is given in Ackermann et al. [4]. Many of these 
approaches are implemented in the extensively used sta- 
tistical computing environment R/Bioconductor [5]. 

The gene set perspective makes sense both biologically 
and statistically. First, small differences in the functions 
of multiple genes in the same set may not be detectable 
at the single gene level, but can add to create larger dif- 
ferences at the gene set level. This increases the power 
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for detecting real biological differences. Second, a single 
hit on a given pathway may be sufficient to generate a 
phenotypic difference. If this hit can occur in any of sev- 
eral components in the pathway, individuals with the 
same phenotype may show variability in the specific 
genes that are hit, but show a more consistent pattern 
at the pathway or gene set level [1,6]. Importantly, even 
when a difference at the single gene level can be 
detected, its biological importance may depend on the 
states of other interacting genes and gene products. 

Cancer genomes contain point mutations, insertions, 
deletions, translocations, methylation abnormalities, and 
copy number (CN) and expression changes not seen in 
normal tissues. In some cancers, such as glioblastoma 
multiforme (GBM), different genes involved in pathways 
involving TP53, phosphoinositide 3-kinase (PI3K), and 
RBI are altered in different patients, and, importantly, 
these might be altered via different mechanisms [1], 
such as point mutations and CN changes. Therefore, 
taking into account multiple data types should improve 
our ability to detect gene sets associated with a 
phenotype. 

In recent large-scale cancer genome studies [1,6,7] 
preliminary integration approaches have been success- 
fully applied; however, these approaches have been tai- 
lored to specific contexts. A general, a scalable and 
rigorous statistical framework has not yet been devel- 
oped. In this article, our goal is to fill this gap. To this 
end, we introduce, compare, and systematically evaluate 
two alternative set-based data integration approaches. 
The first approach is based on computing model-based 
gene-to-phenotype association scores for each gene 
using all data types together, followed by gene set analy- 
sis of these scores. We term this the integrative 
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approach. The second is to perform separate conven- 
tional gene set analyses for each data type, and then 
derive a consensus significance score using a meta-ana- 
lytical approach. 

Results 

Overview 

We present both novel data analyses and controlled 
simulations. First, we jointly examine gene expression 
and CN variation data about glioblastoma multiforme 
tumors from The Cancer Genome Atlas (TCGA) [2], 
and detect differences in the Wnt, glycolysis and stress 
pathways that appear relevant to differences between 
short- and long-term survivors. We also validate these 
findings using independent samples from the NCI Repo- 
sitory for Molecular Brain Neoplasia Data (Rembrandt) 
[8]. To provide a rigorous counterpart to these results 
we perform extensive simulations. These show that the 
integrative approach does enable the discovery of dis- 
ease-related gene sets that would not be discovered 
when each data type is analyzed individually using cur- 
rent approaches. Discoveries remain reliable also when 
several features are highly noisy. 

The Cancer Genome Atlas glioblastoma multiforme study 

We consider TCGA glioblastoma data [2] of four types: 
two gene expression measurements (El, E2) and two 
CN measurements (CI, C2), described in Materials and 
methods. To discover gene sets important in GBM sur- 
vival we use an extreme discordant phenotype design 
[9] with a total of 95 subjects. GBM patients with a sur- 
vival time shorter than the lower quartile (190 days) are 
labeled short-term survivors (STSs), and those with a 
survival time longer than the upper quartile (594 days) 
long-term survivors (LTSs). Such grouping enhances sig- 
nal relevant to survival. We used gene sets from the 
MSigDb canonical pathways. 

First, we consider genes that are measured in all data 
types (genes that are measured only in a subset of plat- 
forms are filtered out), and use a competitive gene set 
test (see Materials and methods), comparing genes within 
a set to the remainder of the annotated genes. The 30 top 
sets discovered by the integrative approach are reported 
in Table 1. If we consider the top 30 sets, we discover 12 
gene sets that are not discovered by any of the standard 
single-data-type analyses. The majority of these sets are 
related to metabolic processes. Six are involved in sugar- 
related metabolic processes and energy production, and 
two (the curated streptomycin biosynthesis pathway, and 
its KEGG (Kyoto Encyclopedia of Genes and Genomes) 
counterpart, hsa00521) are identified as a result of genes 
shared with the sugar metabolism group (six out of eight 
genes in the streptomycin biosynthesis set are paralogs of 
genes in the glycolysis pathway). 



This metabolic shift toward sugar metabolism is not 
surprising since it is known that cancer cells in general 
[10,11], and glioblastoma cells in particular [11], depend 
on the conversion of glucose to lactate in the presence 
of oxygen (Warburg effect [12]). It has also been shown 
that shutdown or down-regulation of the glycolysis 
pathway in glioblastoma is associated with cell death 
[13,14]. 

We find that mean measurements for glycolytic genes 
are, on average, larger in the STS compared to the LTS 
phenotype (Figure 1) and that there are more gene 
copies in STSs. Since reduced glycolysis (and sugar 
usage) promotes GBM cell death, we speculate that 
there might be an association between patient survival 
and efficient sugar metabolism, which is being detected 
by the integrative approach but missed by conventional 
analysis of each data type separately. 

Necrosis and hypoxia are pathognomonic features of 
the highest-grade malignant gliomas and are thought to 
play a key role in the aggressive behavior of GBM, 
including invasiveness and chemo-resistance, through a 
variety of mechanisms [15-17]. The induction of the gly- 
colytic pathway we have documented in STS patients is 
likely to represent an adaptive consequence of hypoxic 
conditions, mediated by genomic alteration and/or 
expression of hypoxia inducible factors (HIF1A and 
HIF2A), which have been shown to induce glycolytic 
genes [18], and recently to play a fundamental role in 
the expansion and maintenance of the GBM stem cell 
compartment [19,20]. 

The other metabolic processes related to gene sets 
identified in our analysis are riboflavin (vitamin B2) 
metabolism and the biosynthesis of glycosphingolipid. 
The involvement of the riboflavin pathways appears to 
be mostly driven by up-regulation of members of the 
myotubularin-related protein family (not shown), which 
act as phosphatases modifying cell membrane phospho- 
lipids. From this perspective, the concomitant enrich- 
ment of 'biosynthesis of glycosphingolipid', mostly 
determined by gene down-regulation at both the CN 
and expression level (not shown), may relate to an early 
observation that membrane lipid modifications occur 
during progression of human gliomas [21], and that gly- 
cosphingolipid profiles correlate with survival grading in 
human gliomas [22]. Intriguingly, a crucial role for the 
PI3K/AKT pathway in the regulation of lipid biosynth- 
esis and signaling pathways was recently reported [23], 
linking our findings to the major molecular alterations 
in the PI3K pathway in GBM identified in TCGA [2]. 

Among the non-metabolic gene sets detected in our 
analysis, we highlight those associated with the stress and 
Wnt pathways. The stress pathway includes genes 
involved in TNF signaling through its receptors TNFR1 
and TNFR2. Cellular responses to TNF encompass a 
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Table 1 P-values for top 30 gene sets discovered by the integrative method using the competitive gene sets test 
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wide range of processes, from induction of cell survival to 
apoptosis. The final outcome results from the modula- 
tion, integration and cross-talk of distinct signaling cas- 
cades that are initiated by TRADD (TNFRl-associated 
DEATH domain protein) and TRAF2 (TNF receptor- 
associated factor 2) [24,25]. The presence of this gene set 
in our analysis is mostly driven by increased expression/ 
CN of pathway members in the STS phenotype (Figure 
1). Factors involved in both survival and apoptosis are 
increased (that is, mitogen-activated protein kinase 
(MAPK) signaling pathway genes NFKB1, TRADD, 
CRADD (CASP2 and RIPK1 domain containing adaptor 
with death domain)). The only two genes with reduced 
expression in the STS group are those encoding TNF, 
which initiates the signaling, and MAPK8, which is 
required for TNF-alpha-induced apoptosis [26]. 

Although extensive evidence has been published 
describing the Wnt pathway's role in embryonic 



development, adult tissue homeostasis, and human dis- 
ease, including cancer [27], little is know about the role 
of the Wnt pathway in GBM. However, recent findings 
have shown that promoter hypermethylation of Wnt 
pathway inhibitors occurs in GBM [28]. In our analysis, 
the relationship of the Wnt pathway to survival in GBM 
patients is driven by both increased and decreased 
expression/CN in the STS group (Figure 1) of genes 
encoding both inhibitors and activators of the Wnt 
pathway. 'Up-regulated' genes include central players of 
the pathway, specifically CTNNB1 (encoding p-catenin) 
and GSK3B (glycogen synthase kinase 3 beta). GSK3 
phosphorylates the APC/AXIN1/CTNNB1 complex, and 
thus targets P-catenin for degradation. Wnt signaling 
activation determines GSK3 inhibition status, resulting 
in P-catenin stabilization, nuclear transfer and transcrip- 
tion activation. These results agree with the recent 
observation that GSK3 inhibition results in glioma cell 
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death through a mechanism that depends on oMYC 
activation, decreased NF-^B activity, and an alteration 
of intracellular glucose metabolism [29]. Even more 
interesting is that increases of GSK3 and NFKB1 in the 
STS group, at both the CN and expression level, are 
accompanied by decreased MYC levels (Figure 1). 

To assess the sensitivity of our results to the choice of 
patients, we considered tertiles instead of quartiles, and 
we used a gene-level Cox regression model on the entire 
patient set. The results were very similar to those pre- 
sented above (Tables SI and S2 in Additional file 1). 

Our gene-to-phenotype association scores are based on 
the difference of the deviances in the gene-level regression 
model. This metric depends on the number of variables 
included in the model. As the number of variables 
increases, the difference of the deviances will grow, even if 
the added variables are not truly correlated with the phe- 
notype, and thus do not provide any additional biological 
signal. Therefore, competitive gene set tests cannot be 
used to analyze genes that are not measured for all data 
types because the genes that are measured on fewer plat- 
forms will get inferior rankings when compared to genes 
that might have the same strength of biological signals but 
are measured everywhere. However, restricting attention 
to the genes measured in all data types might lead to loss 
of some interesting biological information. We extended 
our analysis to the union of genes measured in at least one 
data type. To do this without biasing the results in favor of 
genes represented in multiple data types, we use a self- 
contained gene set test (see Materials and methods), com- 
paring genes within each set to a null distribution based 
on those genes only. This test compares the observed data 
to an internal control based on the null distribution for 
the same set of genes: thus, the values for each gene under 
the null hypothesis account for the number of data types 
for which the gene is available, and the effect of the num- 
ber of platforms on the association scores is properly con- 
trolled. The results are given in Table 2. The top sets 
share pathways with the competitive analysis, including 
sugar metabolic processes. Interestingly, the second most 
significant pathway (HSA04010_MAPK_SIGNALING) 
contains all the genes from the STRESSPATHWAY 
reported above. Smaller P-values and rearrangements in 
the list of top sets for the self-contained test, compared to 
the competitive one, are likely to result mostly from the 
different statistical meaning of the test (the two proce- 
dures test different null hypotheses). 

Independent validation 

We validated results by applying the same method to an 
independent set of glioblastoma samples from the 
Rembrandt database [8]. Because we could only acquire 
information on a relatively limited number of genes, we 
focused on validation of the top 30 sets emerging from 



the self-contained analysis. Despite smaller sample sizes, 
missing genes, and the availability of only two data 
types, the vast majority of the pathways discovered in 
TCGA show strong evidence of association with survival 
in our validation set (Table 2), and the directions of 
association are generally confirmed. 

Simulations 

To generate data as realistic as possible, we began with 
the actual TCGA GBM data just described, reassigned 
phenotype labels at random, spiked in gene set differ- 
ences, and asked each method to recover the set that 
had been spiked in. We used both synthetic non-over- 
lapping gene sets and real chromosomal bands to cap- 
ture both low and high within-set correlations. In each 
spike-in experiment we set a fraction y of genes to be 
truly associated with the phenotype within a spiked-in 
set, and a strength p of the effect of the gene on the 
phenotype. We varied y and p to generate alternative 
scenarios (see Materials and methods for details). 
Synthetic gene sets 

Our synthetic gene set collection mimics the size distribu- 
tion of sets from the MSigDb canonical pathway collection 
[3]. Genes are assigned to sets at random, but so that sets 
in the collection do not share genes. Therefore, we do not 
expect two genes within a set to be more strongly corre- 
lated than any two genes not belonging to the same set. 

We perform gene set analysis on all data types sepa- 
rately, using a standard methodology implemented in 
the limma package in Bioconductor [5,30]. We compare 
these to three multi-platform methods. For each method 
we rank sets by P-value, and select the top ten. We eval- 
uate approaches by comparing the number of true posi- 
tive hits among the top ten sets (Figure 2a). Results 
from the four individual data types are practically indis- 
tinguishable. All three multiple-data-type approaches 
outperform the single-data-type approaches. The inte- 
grative approach is significantly better for small values 
of y, where subsets of altered genes are likely to be dif- 
ferent across data types. In such settings the sensitivity 
of single-data-type methods will be relatively low, but 
the integrative approach enjoys increased sensitivity 
because the integrated gene-to-phenotype association 
score is sensitive to gene alterations in a single data 
type. This property is especially useful when genes in a 
certain set are altered by different biological mechan- 
isms, and consequently measured via different data 
types. Both meta-analytical approaches perform simi- 
larly, and are outperformed by the integrative approach. 
This pattern of performance is similar for other values 
of p. All methods, as expected, perform better as p 
increases (not shown). 

Figure 2b shows receiver operating characteristic 
(ROC) [31] curves from the same simulations to provide 
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Table 2 P-values for top 30 gene sets discovered by the integrative method using the self-contained gene sets test 
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Asterisks indicate sets related to sugar metabolic processes. E1 and E2, single data type analysis using expression data; C1 and C2, single data type analysis using 
copy number data; INT, integrative method. 



results that are independent of list size. The integrative 
approach tends to have the highest sensitivity and speci- 
ficity across all P-value cutoffs. Accuracy results are 
summarized using areas under the ROC curves in Table 
3. Importantly, the integrative approach generally shows 
less variability when compared to both single-data-type 
approaches and meta-analysis. 
Chromosome bands 

Our next gene set collection is defined by the chromo- 
somal location of genes, and constitutes a partition of 
the genes measured. Here we expect mildly increased 
correlations within the sets for gene expression, as well 
as strong spatial correlations within sets, and between 
physically neighboring sets for the CN. 

The performance of single-data-type approaches dif- 
fers between expression and CN (Figure 2a). The strong 
spatial correlations between neighboring bands are 
inherited from the data that we use to construct our 



simulations. In the GBM tumor samples, amplifications 
and deletions tend to be large, sometimes spanning sev- 
eral chromosome bands, and are present only in subsets 
of patients. Because of that, the variability of the Wil- 
coxon test ranks for CN data tends to be much lower 
for neighboring bands than for randomly selected ones. 
Therefore, false positive calls for CN data tend to cluster 
by their ranks and chromosomal position. This explains 
the inferior performance when compared to expression 
data, which are not affected as much by spatial correla- 
tions. In contrast, mild correlations within the sets for 
expression data tend to aid discovery of the spiked-in 
sets. Both expression data types perform better for the 
chromosome band collection than for the synthetic sets, 
where genes within sets were not correlated. Since both 
CN data types tend to produce a large amount of false 
positive calls in the top ten list, the performance of inte- 
gration methods may also deteriorate. The most affected 
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Figure 2 Simulation results, (a) Average number of discovered spiked-in sets among the top ten sets that are inferred to be enriched for 
genes discriminating between two phenotypes, against the expected fraction of the altered genes; (3 = 0.5. (b) Average ROC curves; (3 = 0.5, y = 
0.1. INT, integrative method. 



method is meta-analysis by minimum P-value approach 
because when there is a disagreement between data 
types, the strongest signal is not necessarily correct. 
Averaging P- values leads to a better, but still unsatisfac- 
tory, performance. The integrative approach is affected 
the least by the poor performance of the autocorrelated 
CN data. It still performs best for small and intermedi- 
ate values of y, which are the most common. In practice, 
we may not know which data type is best at capturing 



the signal, in which case using an integrative approach 
provides a robust safe analytical plan. The ROC curves 
(Figure 2b) show that as a higher sensitivity is sought, 
the integration method still retains a substantial edge 
over the single-data-type methods. The performance of 
the meta-analytical methods improves and eventually 
approaches that of individual expression data, though it 
remains inferior to integration. Average areas under 
ROC curves are in given in Table 3. 
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Table 3 Areas under the ROC curves for various simulation scenarios 
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MSigDb canonical pathways 

The MSigDb canonical pathway collection differs from 
the previous two in that it is not a partition of genes. 
Thus, we expect to observe correlations within the sets, 
for biological reasons such as co-regulation, and 
between the sets, because they share common genes. In 
terms of our simulation study, an important issue is 
how to define a true positive call. When we spike-in 
preselected sets, we may also alter other sets that con- 
tain genes that were chosen to be associated with the 
phenotype. Recovering such unintentional spike-in sets 
should not necessarily be considered as a false positive. 
To account for this, we consider a discovered set to be 
a true positive if it shares at least 50% of the genes with 
the union of genes from the original spike-in sets. 

All methods perform worse than before. This is par- 
tially explained by the difficulty of defining a true 



positive. Multi-data-type methods provide gains in per- 
formance of various magnitudes for almost all scenarios. 
For all p, with y > 0.9, the average P- value method mar- 
ginally outperforms the integrative approach. The varia- 
bility of the accuracy of the methods also slightly 
increases compared to the previous scenarios (Table 3). 

Novel discoveries 

Individual data types rank sets differently, discovering 
different sets for a given list size. Figure 3a shows the 
fraction of sets that are exclusively discovered by each 
data type. Figure 3b shows the additional sets discovered 
by the integrative and meta-analytical approaches but 
not by any of the single-data-type analyses (see Materi- 
als and methods for details). The meta-analytical 
approaches show minimal improvement over single- 
data-type analysis, while the integrative method 
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Figure 3 True positive exclusively discovered sets. (a,b) The average fraction of true positive exclusively discovered sets by each of 
traditional one-dimensional analysis (EF in Materials and methods section) (a) and integrative and meta-analytical methods (EF* in Materials and 
methods section) (b). (3 = 0.5, y = 0.1. INT, integrative method. 



discovers a large fraction of sets that are missed by all of 
the one-dimensional analyses. Statistically, this means 
that there is a large increase in the power to detect true 
effects for a given list size. In general, it cannot be guar- 
anteed that a set identified by a single-data-type analysis 
will be necessarily identified using a model-based inte- 
grative approach. Such a property applies to the mini- 
mum P-value meta-analytical approach only when 
significance is held constant. However, our analysis of 



the ROC curves shows that our model-based integration 
approach has the most favorable combination of sensi- 
tivity and specificity. 

Discussion 

We have developed and compared general approaches 
and specific tools to integrate data from genomic studies 
that measure multiple genomic features in the same 
subjects. Using simulations and data analysis, we 
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demonstrate how integration can discover important 
patterns that would otherwise be missed. 

Our methodology provides the greatest advantages 
when genes within a set are altered by different mechan- 
isms. Consider a pathway of ten genes, where one is 
amplified, another is mutated, and a third is overex- 
pressed. Single-data-type analysis and meta-analysis are 
likely to miss this pathway, while an integrative 
approach is poised to detect it. This does not come at 
the price of overemphasizing the pathway's importance 
when there is redundancy in the signal. For example, 
amplification and increased expression of a gene are 
likely to be correlated. The integrative approach recog- 
nizes that this is a single biological signal. Both expres- 
sion and CN enter a single model for that gene, and the 
two measurements will only do marginally better as a 
pair than they would individually if they are highly 
correlated. 

Our approach will adjust for varying reliability across 
data types: if the noise in a data type doubles, this data 
type would automatically contribute less to each gene- 
specific regression, and thus to the final result. Our 
approach could be easily modified to allow users to 
weight certain data types more heavily: for example, the 
gene-specific regression could be estimated using Baye- 
sian methods, where the information on the biological 
importance of the data types is incorporated into prior 
distributions for the coefficients. 

Application to other phenotypes and inclusion of cov- 
ariates are also straightforward, by suitably choosing the 
type of regression used and the variables included. Cox 
regression could be used for survival data, linear regres- 
sion for continuous phenotypes. Sliced inverse regres- 
sion [32] could also prove useful, since it can be directly 
applied to both categorical and continuous phenotypes. 
Importantly, our approach entails no additional price in 
terms of multiple testing compared to one-dimensional 
analysis. 

We operate within a two-stage framework. There are 
approaches for gene set analysis in other contexts that 
operate directly on the raw data - for example, on 
expression levels [33] or mutation counts [34]. These 
approaches have the advantage of allowing for a fuller 
treatment of uncertainty, some of which is lost by treat- 
ing the regression-based scores as data. However, inte- 
gration across platforms is greatly simplified by 
operating on gene summaries. Our initial assumption 
that each gene is measured once for each data type may 
be overcome by adding as many terms to the linear pre- 
dictor as there are measurements for the gene in ques- 
tion, though additional investigation of this issue is 
needed. Other approaches achieve integration by build- 
ing a network that exploit existing knowledge on the 
biological interactions within a pathway [35]. When 



available, this can be very valuable information. On the 
other hand, approaches considered here apply to far lar- 
ger collections of gene sets - for example, positional 
gene sets or collections of previously described prognos- 
tic gene signatures - whose genes do not necessarily 
have direct functional relationships. 

The integrative approach is not readily extensible to 
joint analysis of data when each data type is measured 
in a different set of patients. However, for these studies, 
meta-analytical approaches remain available. In our 
simulations, spiked-in genes are selected independently 
for each data type; therefore, the comparisons between 
meta-analytical approaches and single platform analyses 
are also applicable to situations where different data 
types are measured in different subjects. Simulation 
results show that the meta-analytical approach provides 
improvements over the analysis performed using a single 
data type, unless several data types present very strong 
false positive results. 

Lastly, application of the integrative approach to two 
independent glioblastoma datasets has allowed us to dis- 
cover and validate several gene sets associated with sur- 
vival. These sets would not receive strong support when 
expression and CN data are analyzed separately. The 
role of some of the gene sets found with our approach 
in GBM survival, such as pathways involved in sugar 
metabolism, is strongly supported by recent literature. 
Our suggestion of the previously unreported involve- 
ment of the Wnt pathway in GBM survival provides 
motivation for further, more in-depth biological 
investigation. 

Materials and methods 

Statistical approaches for multi-platform analysis 
Notation and problem definition 

We begin by defining notation for the single data type 
case. The data consist of a matrix of genomic measure- 
ments X and a vector of phenotypes Y. The genomic 
measurements are cross-referenced to a membership 
matrix M whose generic element m gs is the indicator of 
whether gene g is in set 5. Many current gene set analy- 
sis methods proceed in two stages [4,36-38]: gene-level 
testing of differences between groups and set-level test- 
ing of differences in scores. 

In stage I (gene-level testing of differences between 
groups), for each gene g, compute score s g (X,Y), captur- 
ing the relationship between genomic measurements 
and a phenotype. 

In stage II (set-level testing of differences in scores), 
using scores computed in stage I as data, look for associa- 
tion between scores and columns of M. For example, by 
testing whether the distribution of scores for genes in set 
s is different from the distribution of scores for a refer- 
ence group, say the set of all measured genes that are not 



Tyekucheva et al. Genome Biology 201 1, 12:R105 
http://genomebiology.com/201 1 /1 2/1 0/R1 05 



Page 11 of 14 



in the set, via a two-sample statistic t s (s,M s ) - the compe- 
titive test; or by comparing the observed distribution of s g 
{X,Y),ge 5, to its null distribution, when genes are not 
associated with the phenotype - the self-contained test. 

Generalizing the setting of one-dimensional gene set 
analysis, we assume that we have D dimensions to the 
genomic investigation. Each dimension is a data type - 
for example, transcript levels from expression arrays, 
CN data, somatic mutations, methylation data. We 
assume that the dimensions are available for the same 
samples, and that the data provided in each dimension 
are already summarized by gene. 

Data are a series of D matrices of sample-specific 
genomic measurements X 1 ,...^ 0 and a vector/matrix of 
phenotypes Y. These measurements are cross-referenced 
to a single membership matrix M, as earlier. 
Integrative approach 

Heterogeneous data are integrated into a single gene- 
specific score, followed by one-dimensional gene set 
analysis. Formally, stage I consists of evaluating a score 
s g {X 1 1 K 1 X D ,Y) that draws from all the measurements 
available from gene g across all the dimensions studied. 

A relatively simple and general approach is to fit, for 
every gene, a linear or generalized linear model of the 
form: 

ME{Y i \Xl i ,X 2 gi ,...,X°))= J2 X Ut 

de{l...D} 

where 0 is a link function [39] and i the biological 
sample. For each gene, the stage I score can be a mea- 
sure of the overall fit of the model, for example, a likeli- 
hood ratio for comparing this model to the 'null' model 

in which all the coefficients are zero. In stage II 

these scores can then be analyzed using traditional 
methods to obtain set-specific scores £ 5 (s,M 5 ). 
Meta-analytical approach 

This approach starts as a standard one-dimensional gene 
set analysis: we determine gene-to-phenotype association 
scores separately for each dimension and generate a 

matrix of scores whose generic element is Sg(X d ,Y), 

and in stage II compute set-specific scores tf(s,M s ),d e 
1,K, £>, for each dimension. Next these scores can be 
integrated, say, by averaging: 

ts{s,Ms)= avg t d s {sMs)r 

de{l,...,D} 

when evidence of significance from several data types 
is needed, or by taking an extremum: 

t s (s,M s ) = extremumtf(s,M s ), 



when strong evidence from a single dimension is 
sought. 

Implementations 
Single data type analysis 

For each gene we compute the gene-to-phenotype asso- 
ciation score as the difference of deviances of the logis- 
tic regression models with and without the genomic 
measurement as the single predictor. This score is used 
in testing for gene sets that are enriched for genes dif- 
ferent across phenotypes [4]. 
Integrative approach 

For each gene, observations from all available data types 
are used as independent variables in a multivariate logis- 
tic regression model. The integrated gene-to-phenotype 
association score is computed as the difference of the 
deviances of the null model and the model with all pre- 
dictors, and used in gene set tests. We denote this 
implementation of the integrative approach by INT. 
Meta-analytical approach 

We use geometrically averaged P- values, AvgP [40], and 
minimum P- values, MinP [41], from the single-data-type 
gene set tests. 
Gene set tests 

We use the Mann- Whitney test, as implemented in the 
R/Bioconductor package limma [30], for competitive 
analysis, and the signed rank Wilcoxon test for self-con- 
tained analyses. For the latter we generate the null dis- 
tribution of gene-to-phenotype association scores by 
permuting phenotype labels. We perform the test in 500 
permutations, comparing the observed scores and null, 
and then average the resulting test statistics; the P- 
values are obtained from the critical values table of the 
normal approximation to the Wilcoxon signed rank sta- 
tistic. We compare discoveries across methods by 
choosing the same P-value cutoff. The P-values would 
remain comparable across methods if one applied a 
multiple comparison adjustment, as the number of com- 
parisons is the same in the integrative as in each of the 
single-data-type approaches. While there are many 
choices for the scores and gene set tests, the simple pro- 
cedures we chose to implement perform very competi- 
tively [4] and adequately represent the general gene set 
enrichment approach in the context of our proposed 
integration framework. 

The Cancer Genome Atlas data 

We consider TCGA glioblastoma data [2] of four types: 
two gene expression measurements (El, E2) using Affy- 
metrix HT-HG-U133A and Agilent G4502A-07 micro- 
arrays respectively; and two CN measurements (CI, C2), 
both using the Agilent HG-CGH-244A platform but 
performed in two different labs [42]. For expression, we 
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use TCGA's normalization and gene summaries. For 
CN, we average TCGA-normalized probe values by 
gene; here we used the probe to gene mapping provided 
by TCGA in the Array Definition Files. For simulations 
we use 99 samples with randomly assigned binary phe- 
notypes (49 and 50 observations per class). To discover 
gene sets important in GBM survival we use an extreme 
discordant phenotype design [9]. GBM patients with 
survival time shorter than the lower quartile (190 days) 
are labeled STSs, and those with survival time longer 
than the upper quartile (594 days) LTSs. Such grouping 
enhances signal relevant to survival. After removing 
patients who developed a hypermutated phenotype due 
to radiation treatment, we retained 95 observations (47 
STS and 48 with LTS phenotypes). There are 12,042 
genes for El, 17,814 for E2, 21,814 for CI, and 23,167 
for C2. The total number of genes measured in all data 
types is 10,334, with 25,583 genes measured in at least 
one. 

For validation, we retrieved expression (Affymetrix 
U133-Plus-2.0) and CN (Affymetrix lOOK-SNP-Array) 
data for 1,275 genes involved in the pathways discovered 
in TCGA data analysis (see Results) from the 
Rembrandt database [8]. Gene summaries are obtained 
by averaging Rembrandt-preprocessed probe-level data. 
We used 44 STS samples (survival < 231 days), and 36 
LTS samples (survival < 775). 

Simulations 

We construct an empirical null scenario, where no 
genes are associated with the phenotype, by randomly 
assigning phenotype labels. This preserves correlations 
between genes within data types, and correlations 
between data types, yielding a realistic background dis- 
tribution of gene-to-phenotype association scores. We 
randomly select ten gene sets from a given gene set 
collection for 'spiking in'. For each spike-in set we ran- 
domly select a fraction of its genes to be associated 
with the phenotype. The number of genes that differ 
across phenotypes is sampled from the binomial distri- 
bution with parameter y, which reflects the desired 
expected fraction of genes in a set to be associated 
with the phenotype. For each selected gene we add Ak 
to the observations for one of the phenotypes; Ak is 
chosen so that, if the data were normal, with standard 
deviation equal to the average standard deviation 
among genes in the k-th spiked-in set, then the power 
for the two-sample £-test would have value p. We call 
P the signal strength. 

We study the ability of the methods to discover 
spiked-in sets as we vary p and y on a grid. The signal 
strength captures signal-to-noise ratio: at low values of 
this parameter, gene-to-phenotype association scores for 
the spiked-in genes will be more similar to their 



background null distribution. The expected fraction of 
altered genes captures enrichment of such genes in a 
spiked-in set. As this parameter grows, the distribution 
of gene-to-phenotype association scores within a spiked- 
in set is more likely to differ from the distribution of 
the scores for the rest of genes. We use a competitive 
gene set test to evaluate this difference. Each simulation 
scenario is repeated 1,000 times. Complete analysis on 
each simulation repetition takes approximately 10 min- 
utes wall-clock time on a MacBookPro 2.8 GHz Intel 
Core 2 Duo with 4 GB RAM. 

We compute the fraction of correctly discovered 
spiked-in sets that are discovered with a given data type 
but not by any of the other data types as: 



EF{i) = 1 



U ;eSi TPQ)nTP(z) 

I TP(i) I 



where i indexes data types: S = {E1,E2,C1,C2}, S_i is 
the set of all data types excluding the i-th, and TP(>) is 
the number of true positive sets among the top scoring 
ones. For each of the three integration methods we also 
compute the fraction of sets that are discovered exclu- 
sively by that method: 



EF*(Z) = 1 



U jeS TP{j)nTP(l) 
I TP(l) | 



where / e {INT,AvgPMinP}, and S and TP(-) are 
defined above. 

All computations were performed in the statistical 
software [5]. The R code used in our study is avail- 
able as in Additional file 2. The code and R data objects 
are also available for download online [43]. 

Additional material 



Additional file 1: Tables with additional results for TCGA GBM data. 

Table S1: P-values for the top 50 gene sets discovered by the integrative 
method (INT) using the competitive gene set test. Patients from the 
upper tertile of the survival distribution were labeled as long-term 
survivors, and those from the lower tertile short-term survivors. Table S2: 
P-values for the top 50 gene sets discovered by the integrative method 
(INT) using the competitive gene set test. A Cox regression model was 
used to establish gene-to-phenotype association scores. 

Additional file 2: R code for methods described in the paper. An 

archive containing the R implementation of the methods described in 
the paper, and used for simulations and data analysis. 
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